# Pipelining and Superscalar Techniques This chapter deals with pipelining and superscalar design in processor development. We begin with a discussion of conventional linear pipelines and analyze their performance. A generalized pipeline model is introduced to include nonlinear interstage connections. Collision-free scheduling techniques are described for performing dynamic functions. Specific techniques for building instruction pipelines, arithmetic pipelines, and memory-access pipelines are presented. The discussion includes instruction prefetching, internal data forwarding, software interlocking, hardware scoreboarding, hazard avoidance, branch handling, and instruction-issuing techniques. Both static and multifunctional arithmetic pipelines are designed. Superscalar design techniques are studied along with performance analysis. ## LINEAR PIPELINE PROCESSORS A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to the other. In modern computers, linear pipelines are applied for instruction execution, arithmetic computation, and memory-access operations. ## 6.1.1 Asynchronous and Synchronous Models A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the pipeline at the first stage $S_1$ . The processed results are passed from stage $S_i$ to stage $S_{i+1}$ , for all i = 1, 2, ..., k-1. The final result emerges from the pipeline at the last stage $S_k$ . Depending on the control of data flow along the pipeline, we model linear pipelines in two categories: asynchronous and syncnronous. **Asynchronous Model** As shown in Fig. 6.1a, data flow between adjacent stages in an asynchronous pipeline is controlled by a handshaking protocol. When stage $S_i$ is ready to transmit, it sends a *ready* signal to stage $S_{i+1}$ . After stage $S_{i+1}$ receives the incoming data, it returns an *acknowledge* signal to $S_i$ . Asynchronous pipelines are useful in designing communication channels in message-passing multicomputers where pipelined wormhole routing is practiced (see Chapter 9). Asynchronous pipelines may have a variable throughput rate. Different amounts of delay may be experienced in different stages. Fig. 6.1 Two models of linear pipeline units and the corresponding reservation table **Synchronous Model** Synchronous pipelines are illustrated in Fig. 6.1b. Clocked latches are used to interface between stages. The latches are made with master-slave flip-flops, which can isolate inputs from outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously. The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays in all stages. These delays determine the clock period and thus the speed of the pipeline. Unless otherwise specified, only synchronous pipelines are studied in this book. The utilization pattern of successive stages in a synchronous pipeline is specified by a reservation table. For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.1c. This table is essentially a space-time diagram depicting the precedence relationship in using the pipeline stages. For a k-stage linear pipeline, k clock cycles are needed for data to flow through the pipeline. Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is filled up, one result emerges from the pipeline for each additional cycle. This throughput is sustained only if the successive tasks are independent of each other. #### 6.1.2 Clocking and Timing Control The clock cycle $\tau$ of a pipeline is determined below. Let $\tau_i$ be the time delay of the circuitry in stage $S_i$ and d the time delay of a latch, as shown in Fig. 6.1b. Clock Cycle and Throughput Denote the maximum stage delay as $\tau_m$ , and we can write $\tau$ as $$\tau = \max_{i} \left\{ \tau_{i} \right\}_{1}^{k} + d = \tau_{m} + d \tag{6.1}$$ At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch register. The clock pulse has a width equal to d. In general, $\tau_m >> d$ by one to two orders of magnitude. This implies that the maximum stage delay $\tau_m$ dominates the clock period. The pipeline frequency is defined as the inverse of the clock period: $$f = \frac{1}{\tau} \tag{6.2}$$ If one result is expected to come out of the pipeline per cycle, frepresents the maximum throughput of the pipeline. Depending on the initiation rate of successive tasks entering the pipeline, the actual throughput of the pipeline may be lower than f. This is because more than one clock cycle has elapsed between successive task initiations. **Clock Skewing** Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time. However, due to a problem known as *clock skewing*, the same clock pulse may arrive at different stages with a time offset of s. Let $t_{max}$ be the time delay of the longest logic path within a stage and $t_{min}$ that of the shortest logic path within a stage. To avoid a race in two successive stages, we must choose $\tau_m \ge t_{max} + s$ and $d \le t_{min} - s$ . These constraints translate into the following bounds on the clock period when clock skew takes effect: $$d + t_{max} + s \le \tau \le \tau_m + t_{min} - s \tag{6.3}$$ In the ideal case s = 0, $t_{max} = \tau_m$ , and $t_{min} = d$ . Thus, we have $\tau = \tau_m + d$ , consistent with the definition in Eq. 6.1 without the effect of clock skewing. #### 6.1.3 Speedup, Efficiency, and Throughput Ideally, a linear pipeline of k stages can process n tasks in k + (n-1) clock cycles, where k cycles are needed to complete the execution of the very first task and the remaining n-1 tasks require n-1 cycles. Thus the total time required is $$T_k = [k + (n-1)]\tau (6.4)$$ where $\tau$ is the clock period. Consider an equivalent-function nonpipelined processor which has a flow-through delay of $k\tau$ . The amount of time it takes to execute n tasks on this nonpipelined processor is $T_1 = nk\tau$ . **Speedup Factor** The speedup factor of a k-stage pipeline over an equivalent non pipelined processor is defined as $$S_k = \frac{T_1}{T_k} = \frac{nk\tau}{k\tau + (n-1)\tau} = \frac{nk}{k + (n-1)}$$ (6.5) #### Note 6.1 Pipelined versus non-pipelined processors If each pipeline stage has a stage delay of $\tau$ , then clearly an instruction passing through k pipeline stages in a processor sees a total latency of $k\tau$ . Now suppose we also have a non-pipelined processor for the same instruction set, using the same technology. This non-pipelined processor need not present a latency of $k\tau$ to every instruction, because it does not have k separate stages for an instruction to pass through. Since the non-pipelined processor would have a more compact hardware design, we can expect that the average latency seen by instructions on this processor will be smaller than $k\tau$ . In other words, the advantage of a pipelined processor lies in its instruction throughput; in terms of instruction latency, the non-pipelined version can in fact be expected do better. However, for the comparative analysis here, we have assumed that the instruction latency on the non-pipelined version is also $k\tau$ . This is a simplification which does not change substantially the conclusion reached. # Example 6.1 Pipeline speedup versus stream length The maximum speedup is $S_k \to k$ as $n \to \infty$ . This maximum speedup is very difficult to achieve because of data dependences between successive tasks (instructions), program branches, interrupts, and other factors to be studied in subsequent sections. Figure 6.2a plots the speedup factor as a function of n, the number of tasks (operations or instructions) performed by the pipeline. For small values of n, the speedup can be very poor. The smallest value of $S_k$ is 1 when n = 1. The larger the number k of subdivided pipeline stages, the higher the potential speedup performance. When n = 64, an eight-stage pipeline has a speedup value of 7.1 and a four-stage pipeline has a speedup of 3.7. However, the number of pipeline stages cannot increase indefinitely due to practical constraints on costs, control complexity, circuit implementation, and packaging limitations. Furthermore, the stream length n also affects the speedup; the longer the better in using a pipeline. **Optimal Number of Stages** In practice, most pipelining is staged at the functional level with $2 \le k \le 15$ . Very few pipelines are designed to exceed 10 stages in real computers. The optimal choice of the number of pipeline stages should be able to maximize the performance/cost ratio for the target processing load. Let t be the total time required for a nonpipelined sequential program of a given function. To execute the same program on a k-stage pipeline with an equal flow-through delay t, one needs a clock period of p = t/k + d, where d is the latch delay. Thus, the pipeline has a maximum throughput of f = 1/p = 1/(t/k + d). The total pipeline cost is roughly estimated by c + kh, where c covers the cost of all logic stages and h represents the cost of each latch. A pipeline performance/cost ratio (PCR) has been defined by Larson (1973): $$PCR = \frac{f}{c + kh} = \frac{1}{(t/k + d)(c + kh)}$$ (6.6) (a) Speedup factor as a function of the number of operations (Eq. 6.5) (b) Optimal number of pipeline stages (Eqs. 6.6 and 6.7) Fig. 6.2 Speedup factors and the optimal number of pipeline stages for a linear pipeline unit Figure 6.2b plots the PCR as a function of k. The peak of the PCR curve corresponds to an optimal choice for the number of desired pipeline stages: $$k_0 = \sqrt{\frac{t \cdot c}{d \cdot h}} \tag{6.7}$$ where t is the total flow-through delay of the pipeline. Thus the total stage cost c, the latch delay d, and the latch cost h must be considered to achieve the optimal value $k_0$ . **Efficiency and Throughput** The efficiency $E_k$ of a linear k-stage pipeline is defined as $$E_k = \frac{S_k}{k} = \frac{n}{k + (n-1)} \tag{6.8}$$ Obviously, the efficiency approaches 1 when $n \to \infty$ , and a lower bound on $E_k$ is 1/k when n = 1. The pipeline throughput $H_k$ is defined as the number of tasks (operations) performed per unit time: $$H_k = \frac{n}{[k + (n-1)]\tau} = \frac{nf}{k + (n-1)}$$ (6.9) The maximum throughput f occurs when $E_k \to 1$ as $n \to \infty$ . This coincides with the speedup definition given in Chapter 3. Note that $H_k = E_{k'}$ $f = E_k/\tau = S_k/k\tau$ . Other relevant factors of instruction pipelines will be discussed in Chapters 12 and 13. ## .2 NONLINEAR PIPELINE PROCESSORS A dynamic pipeline can be reconfigured to perform variable functions at different times. The traditional linear pipelines are static pipelines because they are used to perform fixed functions. A dynamic pipeline allows feedforward and feedback connections in addition to the streamline connections. For this reason, some authors call such a structure a *nonlinear pipeline*. ## 6.2.1 Reservation and Latency Analysis In a static pipeline, it is relatively easy to partition a given function into a sequence of linearly ordered subfunctions. However, function partitioning in a dynamic pipeline becomes quite involved because the pipeline stages are interconnected with loops in addition to streamline connections. A multifunction dynamic pipeline is shown in Fig. 6.3a. This pipeline has three stages. Besides the streamline connections from $S_1$ to $S_2$ and from $S_2$ to $S_3$ , there is a feed forward connection from $S_1$ to $S_3$ and two feedback connections from $S_3$ to $S_2$ and from $S_3$ to $S_1$ . These feedforward and feedback connections make the scheduling of successive events into the pipeline a nontrivial task. With these connections, the output of the pipeline is not necessarily from the last stage. In fact, following different dataflow patterns, one can use the same pipeline to evaluate different functions. Fig. 6.3 A dynamic pipeline with feed forward and feedback connections for two different functions **Reservation Tables** The reservation table for a static linear pipeline is trivial in the sense that dataflow follows a linear streamline. The *reservation table* for a dynamic pipeline becomes more interesting because a nonlinear pattern is followed. Given a pipeline configuration, multiple reservation tables can be generated for the evaluation of different functions. Two reservation tables are given in Figs. 6.3b and 6.3c, corresponding to a function X and a function Y, respectively. Each function evaluation is specified by one reservation table. A static pipeline is specified by a single reservation table. A dynamic pipeline may be specified by more than one reservation table. Each reservation table displays the time-space flow of data through the pipeline for one function evaluation. Different functions follow different paths through the pipeline. The number of columns in a reservation table is called the *evaluation time* of a given function. For example, the function X requires eight clock cycles to evaluate, and function Y requires six cycles, as shown in Figs. 6.3b and 6.3c, respectively. A pipeline *initiation* table corresponds to each function evaluation. All initiations to a static pipeline use the same reservation table. On the other hand, a dynamic pipeline may allow different initiations to follow a mix of reservation tables. The checkmarks in each row of the reservation table correspond to the time instants (cycles) that a particular stage will be used. There may be multiple checkmarks in a row, which means repeated usage of the same stage in different cycles. Contiguous checkmarks in a row simply imply the extended usage of a stage over more than one cycle. Multiple checkmarks in a column mean that multiple stages need to be used in parallel during a particular clock cycle. **Latency Analysis** The number of time units (clock cycles) between two initiations of a pipeline is *the latency* between them. Latency values must be nonnegative integers. A latency of k means that two initiations are separated by k clock cycles. Any attempt by two or more initiations to use the same pipeline stage at the same time will cause a *collision*. A collision implies resource conflicts between two initiations in the pipeline. Therefore, all collisions must be avoided in scheduling a sequence of pipeline initiations. Some latencies will cause collisions, and some will not. Latencies that cause collisions are called *forbidden latencies*. In using the pipeline in Fig. 6.3 to evaluate the function X, latencies 2 and 5 are forbidden, as illustrated in Fig. 6.4. Fig. 6.4 Collisions with forbidden latencies 2 and 5 in using the pipeline in Fig. 6.3 to evaluate the function X The *i*th initiation is denoted as $X_i$ in Fig. 6.4. With latency 2, initiations $X_1$ and $X_2$ collide in stage 2 at time 4. At time 7, these initiations collide in stage 3. Similarly, other collisions are shown at times 5, 6, 8, ..., etc. 234 The collision patterns for latency 5 are shown in Fig. 6.4b, where $X_1$ and $X_2$ are scheduled 5 clock cycles apart. Their first collision occurs at time 6. To detect a forbidden latency, one needs simply to check the distance between any two checkmarks in the same row of the reservation table. For example, the distance between the first mark and the second mark in row $S_1$ in Fig. 6.3b is 5, implying that 5 is a forbidden latency. Similarly, latencies 2, 4, 5, and 7 are all seen to be forbidden from inspecting the same reservation table. From the reservation table in Fig. 6.3c, we discover the forbidden latencies 2 and 4 for function Y. A *latency* sequence is a sequence of permissible nonforbidden latencies between successive task initiations. A latency cycle is a latency sequence which repeats the same subsequence (cycle) indefinitely. Figure 6.5 illustrates latency cycles in using the pipeline in Fig. 6.3 to evaluate the function X without causing a collision. For example, the latency cycle (1, 8) represents the infinite latency sequence 1, 8, 1, 8, 1, 8, .... This implies that successive initiations of new tasks are separated by one cycle and eight cycles alternately. Fig. 6.5 Three valid latency cycles for the evaluation of function X The average latency of a latency cycle is obtained by dividing the sum of all latencies by the number of latencies along the cycle. The latency cycle (1, 8) thus has an average latency of (1 + 8)/2 = 4.5. A constant cycle is a latency cycle which contains only one latency value. Cycles (3) and (6) in Figs. 6.5b and 6.5c are both constant cycles. The average latency of a constant cycle is simply the latency itself. In the next section, we describe how to obtain these latency cycles systematically. #### 6.2.2 Collision-Free Scheduling When scheduling events in a nonlinear pipeline, the main objective is to obtain the shortest average latency between initiations without causing collisions. In what follows, we present a systematic method for achieving such collision-free scheduling. We study below collision vectors, state diagrams, single cycles, greedy cycles, and minimal average latency (MAL). This pipeline design theory was originally developed by Davidson (1971) and his students. **Collision Vectors** By examining the reservation table, one can distinguish the set of permissible latencies from the set of forbidden latencies. For a reservation table with n columns, the maximum forbidden latency $m \le n - 1$ . The permissible latency p should be as small as possible. The choice is made in the range $1 \le p \le m - 1$ . A permissible latency of p = 1 corresponds to the ideal case. In theory, a latency of 1 can always be achieved in a static pipeline which follows a linear (diagonal or streamlined) reservation table as shown in Fig. 6.1c. The combined set of permissible and forbidden latencies can be easily displayed by a collision vector, which is an *m*-bit binary vector $C = (C_m C_{m-1} \dots C_2 C_1)$ . The value of $C_i = 1$ if latency *i* causes a collision and $C_i = 0$ if latency *i* is permissible. Note that it is always true that $C_m = 1$ , corresponding to the maximum forbidden latency. For the two reservation tables in Fig. 6.3, the collision vector $C_X = (1011010)$ is obtained for function X, and $C_Y = (1010)$ for function Y. From $C_X$ , we can immediately tell that latencies 7, 5, 4, and 2 are forbidden and latencies 6, 3, and 1 are permissible. Similarly, 4 and 2 are forbidden latencies and 3 and 1 are permissible latencies for function Y. **State Diagrams** From the above collision vector, one can construct a *state diagram* specifying the permissible state transitions among successive initiations. The collision vector, like $C_X$ above, corresponds to the *initial state* of the pipeline at time 1 and thus is called an *initial collision vector*. Let p be a permissible latency within the range $1 \le p \le m-1$ . The *next state* of the pipeline at time t + p is obtained with the assistance of an *m*-bit right shift register as in Fig. 6.6a. The initial collision vector C is initially loaded into the register. The register is then shifted to the right. Each 1-bit shift corresponds to an increase in the latency by 1. When a 0 bit emerges from the right end after p shifts, it means p is a permissible latency. Likewise, a 1 bit being shifted out means a collision, and thus the corresponding latency should be forbidden. Logical 0 enters from the left end of the shift register. The next state after p shifts is thus obtained by bitwise-ORing the initial collision vector with the shifted register contents. For example, from the initial state $C_X = (1011010)$ , the next state (1111111) is reached after one right shift of the register, and the next state (1011011) is reached after three shifts or six shifts. # Example 6.2 The state transition diagram for a pipeline unit A state diagram is obtained in Fig. 6.6b for function X. From the initial state (1011010), only three outgoing transitions are possible, corresponding to the three permissible latencies 6, 3, and 1 in the initial collision vector. Similarly, from state (1011011), one reaches the same state after either three shifts or six shifts. 236 When the number of shifts is m + 1 or greater, all transitions are redirected back to the initial state. For example, after eight or more (denoted as $8^+$ ) shifts, the next state must be the initial state, regardless of which state the transition starts from. In Fig. 6.6c, a state diagram is obtained for the reservation table in Fig. 6.3c using a 4-bit shift register. Once the initial collision vector is determined, the corresponding state diagram is uniquely determined. (a) State transition using an n-bit right shift register, where n is the maximum forbidden latency Fig. 6.6 Two state diagrams obtained from the two reservation tables in Fig. 6.3, respectively The 0's and 1's in the present state, say at time t, of a state diagram indicate the permissible and forbidden latencies, respectively, at time t. The bitwise ORing of the shifted version of the present state with the initial collision vector is meant to prevent collisions from future initiations starting at time t+1 and onward. Thus the state diagram covers all permissible state transitions that avoid collisions. All latencies equal to or greater than m are permissible. This implies that collisions can always be avoided if events are scheduled far apart (with latencies of $m^+$ ). However, such long latencies are not tolerable from the viewpoint of pipeline throughput. **Greedy Cycles** From the state diagram, we can determine optimal latency cycles which result in the MAL. There are infinitely many latency cycles one can trace from the state diagram. For example, (1, 8), (1, 8, 6, 8), (3), (6), (3, 8), (3, 6, 3) ..., are legitimate cycles traced from the state diagram in Fig. 6.6b. Among these cycles, only *simple cycles* are of interest. A simple cycle is a latency cycle in which each state appears only once. In the state diagram in Fig. 6.6b, only (3), (6), (8), (1, 8), (3, 8), and (6, 8) are simple cycles. The cycle (1, 8, 6, 8) is not simple because it travels through the state (1011010) twice. Similarly, the cycle (3, 6, 3, 8, 6) is not simple because it repeats the state (1011011) three times. Some of the simple cycles are *greedy cycles*. A greedy cycle is one whose edges are all made with minimum latencies from their respective starting states. For example, in Fig. 6.6b the cycles (1, 8) and (3) are greedy cycles. Greedy cycles in Fig. 6.6c are (1, 5) and (3). Such cycles must first be simple, and their average latencies must be lower than those of other simple cycles. The greedy cycle (1, 8) in Fig. 6.6b has an average latency of (1 + 8)/2 = 4.5, which is lower than that of the simple cycle (6, 8) = (6 + 8)/2 = 7. The greedy cycle (3) has a constant latency which equals the MAL for evaluating function X without causing a collision. The MAL in Fig. 6.6c is 3, corresponding to either of the two greedy cycles. The minimum-latency edges in the state diagrams are marked with asterisks. At least one of the greedy cycles will lead to the MAL. The collision-free scheduling of pipeline events is thus reduced to finding greedy cycles from the set of simple cycles. The greedy cycle yielding the MAL is the final choice. #### 6.2.3 Pipeline Schedule Optimization An optimization technique based on the MAL is given below. The idea is to insert noncompute delay stages into the original pipeline. This will modify the reservation table, resulting in a new collision vector and an improved state diagram. The purpose is to yield an optimal latency cycle, which is absolutely the shortest. **Bounds on the MAL** In 1972, Shar determined the following bounds on the *minimal average latency* (MAL) achievable by any control strategy on a statically reconfigured pipeline executing a given reservation table: - (1) The MAL is lower-bounded by the maximum number of checkmarks in any row of the reservation table. - (2) The MAL is lower than or equal to the average latency of any greedy cycle in the state diagram. - (3) The average latency of any greedy cycle is upper-bounded by the number of 1's in the initial collision vector plus 1. This is also an upper bound on the MAL. Interested readers may refer to Shar (1972) or find proofs of these bounds in Kogge (1981). These results suggest that the optimal latency cycle must be selected from one of the lowest greedy cycles. However, a greedy cycle is not sufficient to guarantee the optimality of the MAL. The lower bound guarantees the optimality. For example, the MAL = 3 for both function X and function Y and has met the lower bound of 3 from their respective reservation tables. From Fig. 6.6b, the upper bound on the MAL for function X is equal to 4 + 1 = 5, a rather loose bound. On the other hand, Fig. 6.6c shows a rather tight upper bound of 2 + 1 = 3 on the MAL. Therefore, all greedy cycles for function Y lead to the optimal latency value of 3, which cannot be lowered further. To optimize the MAL, one needs to find the lower bound by modifying the reservation table. The approach is to reduce the maximum number of checkmarks in any row. The modified reservation table must preserve the original function being evaluated. Patel and Davidson (1976) have suggested the use of noncompute delay stages to increase pipeline performance with a shorter MAL. Their technique is described below. **Delay Insertion** The purpose of delay insertion is to modify the reservation table, yielding a new collision vector. This leads to a modified state diagram, which may produce greedy cycles meeting the lower bound on the MAL. Before delay insertion, the three-stage pipeline in Fig. 6.7a is specified by the reservation table in Fig. 6.7b. This table leads to a collision vector C = (1011), corresponding to forbidden latencies 1, 2, and 4. The corresponding state diagram (Fig. 6.7c) contains only one self-reflecting state with a greedy cycle of latency 3 equal to the MAL. Based on the given reservation table, the maximum number of checkmarks in any row is 2. Therefore, the MAL = 3 so obtained in Fig. 6.7c is not optimal. # Example 6.3 Inserting noncompute delays to reduce the MAL To insert a noncompute stage $D_1$ after stage $S_3$ will delay both $X_1$ and $X_2$ operations one cycle beyond time 4. To insert yet another noncompute stage $D_2$ after the second usage of $S_1$ will delay the operation $X_2$ by another cycle. These delay operations, as grouped in Fig. 6.7b, result in a new pipeline configuration in Fig. 6.8a. Both delay elements $D_1$ and $D_2$ are inserted as extra stages, as shown in Fig. 6.8b with an enlarged reservation table having 3 + 2 = 5 rows and 5 + 2 = 7 columns. Fig. 6.7 A pipline with a minimum average latency of 3 In total, the operation $X_1$ has been delayed one cycle from time 4 to time 5 and the operation $X_2$ has been delayed two cycles from time 5 to time 7. All remaining operations (marked as X in Fig. 6.8b) are unchanged. This new table leads to a new collision vector (100010) and a modified state diagram in Fig. 6.8c. (a) Insertion of two noncompute delay stages (b) Modified reservation table (c) Modified state diagram with a reduced MAL = (1 + 3)/2 = 2 Fig. 6:8 Insertion of two delay stages to obtain an optimal MAL for the pipeline in Fig. 6.7 This diagram displays a greedy cycle (1, 3), resulting in a reduced MAL = (1 + 3)/2 = 2. The delay insertion thus improves the pipeline performance, yielding a lower bound for the MAL. **Pipeline Throughput** This is essentially the initiation rate or the average number of task initiations per clock cycle. If N tasks are initiated within n pipeline cycles, then the *initiation rate* or *pipeline throughput* is measured as N/n. This rate is determined primarily by the inverse of the MAL adapted. Therefore, the scheduling strategy does affect the pipeline performance. In general, the shorter the adapted MAL, the higher the throughput that can be expected. The highest achievable throughput is one task initiation per cycle, when the MAL equals 1 since $1 \le MAL \le$ the shortest latency of any greedy cycle. Unless the MAL is reduced to 1, the pipeline throughput becomes a fraction. **Pipeline Efficiency** Another important measure is *pipeline efficiency*. The percentage of time that each pipeline stage is used over a sufficiently long series of task initiations is the *stage utilization*. The accumulated rate of all stage utilizations determines the pipeline efficiency. Let us reexamine latency cycle (3) in Fig. 6.5b. Within each latency cycle of three clock cycles, there are two pipeline stages, $S_1$ and $S_3$ , which are completely and continuously utilized after time 6. The pipeline stage $S_2$ is used for two cycles and is idle for one cycle. Therefore, the entire pipeline can be considered 8/9 = 88.8% efficient for latency cycle (3). On the other hand, the pipeline is only 14/27 = 51.8% efficient for latency cycle (1, 8) and 8/16 = 50% efficient for latency cycle (6), as illustrated in Figs. 6.5a and 6.5c, respectively. Note that none of the three stages is fully utilized with respect to two initiation cycles. The pipeline throughput and pipeline efficiency are related to each other. Higher throughput results from a shorter latency cycle. Higher efficiency implies less idle time for pipeline stages. The above example demonstrates that higher throughput also accompanies higher efficiency. Other examples however may show a contrary conclusion. The relationship between the two measures is a function of the reservation table and of the initiation cycle adopted. At least one stage of the pipeline should be fully (100%) utilized at the steady state in any acceptable initiation cycle; otherwise, the pipeline capability has not been fully explored. In such cases, the initiation cycle may not be optimal and another initiation cycle should be examined for improvement. ### INSTRUCTION PIPELINE DESIGN A stream of instructions can be executed by a pipeline in an overlapped manner. We describe below instruction pipelines for CISC and RISC scalar processors. Topics to be studied include instruction prefetching, data forwarding, hazard avoidance, interlocking for resolving data dependences, dynamic instruction scheduling, and branch handling techniques for improving pipelined processor performance. Further discussion on instruction level parallelism will be found in Chapter 12. ### 6.3.1 Instruction Execution Phases A typical instruction execution consists of a sequence of operations, including instruction fetch, decode, operand fetch, execute, and write-back phases. These phases are ideal for overlapped execution on a linear pipeline. **Pipelined Instruction Processing** A typical instruction pipeline is depicted in Fig. 6.9. The *fetch stage* (F) fetches instructions from a cache memory, ideally one per cycle. The *decode stage* (D) reveals the instruction function to be performed and identifies the resources needed. Resources include general-purpose registers, buses, and functional units. The *issue stage* (I) reserves resources. The operands are also read from registers during the issue stage. The instructions are executed in one or several execute stages (E). Three execute stages are shown in Fig. 6.9a. The last writeback stage (W) is used to write results into the registers. Memory load or store operations are treated as part of execution. Figure 6.9 shows the flow of machine instructions through a typical pipeline. These eight instructions are for pipelined execution of the high-level language statements X = Y + Z and $A = B \times C$ . Here we have assumed that load and store instructions take four execution clock cycles, while floating-point add and multiply operations take three cycles. The above timing assumptions represent typical values found in an older CISC processor. In many RISC processors, fewer clock cycles are needed. On the other hand, Cray 1 required 11 cycles for a load and a floating-point addition took six. With in-order instruction issuing, if an instruction is blocked from issuing due to a data or resource dependence, all instructions following it are blocked. Figure 6.9b illustrates the issue of instructions following the original program order. The shaded boxes correspond to idle cycles when instruction issues are blocked due to resource latency or conflicts or due to data dependences. The first two *load* instructions issue on consecutive cycles. The *add* is dependent on both *loads* and must wait three cycles before the data (Y and Z) are loaded in. Similarly, the *store* of the sum to memory location X must wait three cycles for the *add* to finish due to a flow dependence. There are similar blockages during the calculation of A. The total time required is 17 clock cycles. This time is measured beginning at cycle 4 when the first instruction starts execution until cycle 20 when the last instruction starts execution. This timing measure eliminates the undue effects of the pipeline "startup" or "draining" delays. Figure 6.9c shows an improved timing after the instruction issuing order is changed to eliminate unnecessary delays due to dependence. The idea is to issue all four *load* operations in the beginning. Both the add and multiply instructions are blocked fewer cycles due to this data prefetching. The reordering should not change the end results. The time required is being reduced to 11 cycles, measured from cycle 4 to cycle 14. Fig. 6.9 Pipelined execution of X = Y + Z and A = B × C (Courtesy of James Smith; reprinted with permission from IEEE Computer, July 1989) # Example 6.4 The MIPS R4000 instruction pipeline The MIPS R4000 was a pipelined 64-bit processor using separate instruction and data caches and an eight-stage pipeline for executing register-based instructions. As illustrated in Fig. 6.10, the processor pipeline design was targeted to achieve an execution rate approaching one instruction per cycle. (b) R4000 instruction overlapping in pipeline Fig. 6.10 The architecture of the MIPS R4000 instruction pipeline (Courtesy of MIPS Computer Systems) The execution of each R4000 instruction consisted of eight major steps as summarized in Fig. 6.10a. Each of these steps required approximately one clock cycle. The instruction and data memory references are split across two stages. The single-cycle ALU stage took slightly more time than each of the cache access stages. The overlapped execution of successive instructions is shown in Fig. 6.10b. This pipeline operated efficiently because different CPU resources, such as address and bus access, ALU operations, register accesses, and so on, were utilized simultaneously on a noninterfering basis. The internal pipeline clock rate (100 MHz) of the R4000 was twice the external input or master clock frequency. Figure 6.10b shows the optimal pipeline movement, completing one instruction every internal clock cycle. Load and branch instructions introduce extra delays. ### 6.3.2 Mechanisms for Instruction Pipelining We introduce instruction buffers and describe the use of cacheing, collision avoidance, multiple functional units, register tagging, and internal forwarding to smooth pipeline flow and to remove bottlenecks and unnecessary memory access operations. **Prefetch Buffers** Three types of buffers can be used to match the instruction fetch rate to the pipeline consumption rate. In one memory-access time, a block of consecutive instructions are fetched into a prefetch buffer as illustrated in Fig. 6.11. The block access can be achieved using interleaved memory modules or using a cache to shorten the effective memory-access time as demonstrated in the MIPS R4000. Fig. 6.11 The use of sequential and target buffers Sequential instructions are loaded into a pair of sequential buffers for in-sequence pipelining. Instructions from a branch target are loaded into a pair of target buffers for out-of-sequence pipelining. Both buffers operate in a first-in-first-out fashion. These buffers become part of the pipeline as additional stages. A conditional branch instruction causes both sequential buffers and target buffers to fill with instructions. After the branch condition is checked, appropriate instructions are taken from one of the two buffers, and instructions in the other buffer are discarded. Within each pair, one can use one buffer to load instructions from memory and use another buffer to feed instructions into the pipeline. The two buffers in each pair alternate to prevent a collision between instructions flowing into and out of the pipeline. A third type of prefetch buffer is known as a *loop buffer*. This buffer holds sequential instructions contained in a small loop. The loop buffers are maintained by the fetch stage of the pipeline. Prefetched instructions in the loop body will be executed repeatedly until all iterations complete execution. The loop buffer operates in two steps. First, it contains instructions sequentially ahead of the current instruction. This saves the instruction fetch time from memory. Second, it recognizes when the target of a branch falls within the loop boundary. In this case, unnecessary memory accesses can be avoided if the target instruction is already in the loop buffer. The CDC 6600 and Cray 1 made use of loop buffers. **Multiple Functional Units** Sometimes a certain pipeline stage becomes the bottleneck. This stage corresponds to the row with the maximum number of checkmarks in the reservation table. This bottleneck problem can be alleviated by using multiple copies of the same stage simultaneously. This leads to the use of multiple execution units in a pipelined processor design (Fig. 6.12). Fig. 6.12 A pipelined processor with multiple functional units and distributed reservation stations supported by tagging (Courtesy of G. Sohi; reprinted with permission from IEEE Transactions on Computers, March 1990) Sohi (1990) used a model architecture for a pipelined scalar processor containing multiple functional units (Fig. 6.12). In order to resolve data or resource dependences among the successive instructions entering the pipeline, the reservation stations (RS) are used with each functional unit. Operations wait in the RS until their data dependences have been resolved. Each RS is uniquely identified by a tag, which is monitored by a tag unit. The tag unit keeps checking the tags from all currently used registers or RSs. This register tagging technique allows the hardware to resolve conflicts between source and destination registers assigned for multiple instructions. Besides resolving conflicts, the RSs also serve as buffers to interface the pipelined functional units with the decode and issue units. The multiple functional units operate in parallel, once the dependences are resolved. This alleviates the bottleneck in the execution stages of the instruction pipeline. **Internal Data Forwarding** The throughput of a pipelined processor can be further improved with internal data forwarding among multiple functional units. In some cases, some memory-access operations can be replaced by register transfer operations. The idea is described in Fig. 6.13. A store-load forwarding is shown in Fig. 6.13a in which the load operation (LD R2, M) from memory to register R2 can be replaced by the move operation (MOVE R2, R1) from register R1 to register R2. Since register transfer is faster than memory access, this data forwarding will reduce memory traffic and thus results in a shorter execution time. Similarly, load-load forwarding (Fig. 6.13b) eliminates the second load operation (LD R2, M) and replaces it with the move operation (MOVE R2, R1). Further discussion on operand forwarding will be continued in Chapter 12. Fig. 6.13 Internal data forwarding by replacing memory-access operations with register gransfer operations # Example 6.5 Implementing the dot-product operation with internal data forwarding between a multiply unit and an add unit One can feed the output of a multiplier directly to the input of an adder (Fig. 6.14) for implementing the following dot-product operation: $$s = \sum_{i=1}^{n} a_i \times b_i \tag{6.10}$$ Without internal data forwarding between the two functional units, the three instructions must be sequentially executed in a looping structure (Fig. 6.14a). With data forwarding, the output of the multiplier is fed directly into the input register R4 of the adder (Fig. 6.14b). At the same time, the output of the multiplier is also routed to register R3. Internal data forwarding between the two functional units thus reduces the total execution time through the pipelined processor. **Hazard Avoidance** The *read* and *write* of shared variables by different instructions in a pipeline may lead to different results if these instructions are executed out of order. As illustrated in Fig. 6.15, three types of logic *hazards* are possible. Consider two instructions I and J. Instruction J is assumed to logically follow instruction I according to program order. If the actual execution order of these two instructions violates the program order, incorrect results may be read or written, thereby producing hazards. Hazards should be prevented before these instructions enter the pipeline, such as by holding instruction J until the dependence on instruction I is resolved. We use the notation D(I) and R(I) for the *domain* and *range* of an instruction I. The domain contains the *input set* (such as operands in registers or in memory) to be used by instruction I. The range corresponds to the *output set* of instruction I. Listed below are conditions under which possible hazards can occur: Fig. 6.14 Internal data forwarding for implementing the dot-product operation Fig. 6.15 Possible hazards between read and write operations in an instruction pipeline (instruction I is ahead of instruction J in program order) $$R(I) \cap D(J) \neq \phi$$ for RAW hazard $R(I) \cap R(J) \neq \phi$ for WAW hazard $D(I) \cap R(J) \neq \phi$ for WAR hazard $$(6.11)$$ These conditions are necessary but not sufficient. This means the hazard may not appear even if one or more of the conditions exist. The RAW hazard corresponds to the flow dependence, WAR to the antidependence, and WAW to the output dependence introduced in Section 2.1. The occurrence of a logic hazard depends on the order in which the two instructions are executed. Chapter 12 discusses techniques to handle such hazards. ## 6.3.3 Dynamic Instruction Scheduling In this section, we describe three methods for scheduling instructions through an instruction pipeline. The *static scheduling* scheme is supported by an optimizing compiler. *Dynamic scheduling* is achieved using a technique such as Tomasulo's *register-tagging* scheme built in the IBM 360/91, or the *scoreboarding* scheme built in the CDC 6600 processor. **Static Scheduling** Data dependences in a sequence of instructions create interlocked relationships among them. Interlocking can be resolved through a compiler-based static scheduling approach. A compiler or a postprocessor can be used to increase the separation between interlocked instructions. Consider the execution of the following code fragment. The *multiply* instruction cannot be initiated until the preceding *load* is complete. This data dependence will stall the pipeline for three clock cycles since the two *loads* overlap by one cycle. | Stage delay: | Instruc | tion: | | |--------------|----------|-------------------|------------------------------------| | 2 cycles | Add | R0, R1 | $/R0 \leftarrow (R0) + (R1)/$ | | 1 cycle | Move | R1, R5 | /R1 ← (R5)/ | | 2 cycles | Load | R2, M( $\alpha$ ) | $/R2 \leftarrow (Memory(\alpha))/$ | | 2 cycles | Load | R3, M(β) | $/R3 \leftarrow (Memory(\beta))/$ | | 3 cycles | Multiply | R2, R3 | $/R2 \leftarrow (R2) \times (R3)/$ | The two *loads*, since they are independent of the *add* and *move*, can be moved ahead to increase the spacing between them and the *multiply* instruction. The following program is obtained after this modification: | Load | R2, M(α) | 2 to 3 cycles | |----------|-----------|-----------------------------| | Load | R3, M (β) | 2 cycles due to overlapping | | Add | R0, R1 | 2 cycles | | Move | R1, R5 | 1 cycle | | Multiply | R2, R3 | 3 cycles | Through this code rearrangement, the data dependences and program semantics are preserved, and the *multiply* can be initiated without delay. While the operands are being loaded from memory cells $\alpha$ and $\beta$ into registers R2 and R3, the two instructions *add* and *move* consume three cycles and thus pipeline stalling is avoided. 248 **Tomasulo's Algorithm** This hardware dependence-resolution scheme was first implemented with multiple floating-point units of the IBM 360/91 processor. The hardware platform is abstracted in Fig. 6.12. For the Model 91 processor, three RSs were used in a floating-point adder and two pairs in a floating-point multiplier. The scheme resolved resource conflicts as well as data dependences using register tagging to allocate or deallocate the source and destination registers. An issued instruction whose operands are not available is forwarded to an RS associated with the functional unit it will use. It waits until its data dependences have been resolved and its operands become available. The dependence is resolved by monitoring the result bus (called *common data bus* in Model 91). When all operands for an instruction are available, it is dispatched to the functional unit for execution. All working registers are tagged. If a source register is busy when an instruction reaches the issue stage, the tag for the source register is forwarded to an RS. When the register data becomes available, it also reaches the RS which has the same tag. # Example 6.6 Tomasulo's algorithm for dynamic instruction scheduling Tomasulo's algorithm was applied to work with processors having a few floating-point registers. In the case of Model 91, only four registers were available. Figure 6.16a shows a minimum-register machine code for computing X = Y + Z and $A = B \times C$ . The pipeline timing with Tomasulo's algorithm appears in Fig. 6.16b. Here, the total execution time is 13 cycles, counting from cycle 4 to cycle 15 by ignoring the pipeline startup and draining times. Memory is treated as a special functional unit. When an instruction has completed execution, the result (along with its tag) appears on the result bus. The registers as well as the RSs monitor the result bus and update their contents (and ready/busy bits) when a matching tag is found. Details of the algorithm can be found in the original paper by Tomasulo (1967). Fig. 6.16 Dynamic instruction scheduling using Tomasulo's algorithm on the processor in Fig. 6.12 (Courtesy of James Smith; reprinted with permission from *IEEE Computer*, July 1989) **CDC Scoreboarding** The CDC 6600 was an early high-performance computer that used dynamic instruction scheduling hardware. Figure 6.17a shows a CDC 6600-like processor, in which multiple functional units appeared as multiple execution pipelines. Parallel units allowed instructions to complete out of the original program order. The processor had instruction buffers for each execution unit. Instructions were issued to available functional units regardless of whether register input data was available. Fig. 6.17 Hardware scoreboarding for dynamic instruction scheduling (Courtesy of James Smith: reprinted with permission from *IEEE Computer*, July 1989) The instruction would then wait in a buffer for its data to be produced by other instructions. To control the correct routing of data between execution units and registers, the CDC 6600 used a centralized control unit known as the *scoreboard*. This unit kept track of the registers needed by instructions waiting for the various functional units. When all registers had valid data, the scoreboard enabled the instruction execution. Similarly, when a functional unit finished, it signaled the scoreboard to release the resources. # Example 6.7 Pipelined operations using hardware scoreboarding on the CDC 6600-like processor (James Smith, 1989) Figure 6.17b shows the pipeline schedule based on scoreboard issue logic. The schedule corresponds to the execution of the same machine code for X = Y + Z and $A = B \times C$ . The pipeline latencies are the same as those resulting from Tomasulo's algorithm. The *add* instruction is issued to its functional unit before its registers are ready. It then waits for its input register operands. The scoreboard routes the register values to the adder unit when they become available. In the meantime, the issue stage is not blocked, so other instructions can bypass the blocked add. It takes 13 clock cycles to perform the operations. Details of the CDC scoreboarding can be found in the book by Thornton (1970). The scoreboard is a centralized control logic which keeps track of the status of registers and multiple functional units. When functional units generate new results, some data dependences can be resolved and thus a higher degree of parallelism can be explored with scoreboarding. Scoreboarding in latter microprocessors like MC88000 used forwarding logic and register tagging. In a way, scoreboarding implements a kind of data-driven mechanism to achieve efficient computations. Dynamic instruction scheduling was implemented only in high-end mainframes or supercomputers in the past. Most microprocessors used static scheduling. But the trend has changed over the last two decades. RISC and superscalar processors are today built with hardware support of dynamic scheduling at runtime. Significant trace-driven data are needed to optimize the pipelined processor design. Toward this goal, processor and compiler designers have to work together to achieve an efficient design. Multiple-issue instruction pipelines, which are much more complicated than single-issue instruction pipelines, will be studied in Section 6.5. #### 6.3.4 Branch Handling Techniques The performance of pipelined processors is limited by data dependences and branch instructions. In previous sections, we have studied the effects of data dependence. In this subsection, we study the effects of branching. Various branching strategies are reviewed. The evaluation of branching strategies can be performed either on specific pipeline architecture using trace data, or by applying analytic models. We provide below a simple performance analysis. For a more detailed treatment of the subject, readers are referred to the book *Branch Strategy Taxonomy and Performance Models* by Harvey Cragon (1992). **Effect of Branching** Three basic terms are introduced below for the analysis of branching effects: The action of fetching a nonsequential or remote instruction after a branch instruction is called *branch taken*. The instruction to be executed after a branch taken is called a *branch target*. The number of pipeline cycles wasted between a branch taken and the fetching of its branch target is called the *delay slot*, denoted by b. In general, $0 \le b \le k-1$ , where k is the number of pipeline stages. When a branch is taken, all the instructions following the branch in the pipeline become useless and will be drained from the pipeline. This implies that a branch taken causes the pipeline to be flushed, losing a number of useful cycles. These terms are illustrated in Fig. 6.18, where a branch taken causes $I_{b+1}$ through $I_{b+k-1}$ to be drained from the pipeline. Let p be the probability of a conditional branch instruction in a typical instruction stream and q the probability of a successfully executed conditional branch instruction (a branch taken). Typical values of p = 20% and q = 60% have been observed in some programs. The penalty paid by branching is equal to $pqnb\tau$ because each branch taken costs $b\tau$ extra pipeline cycles. Based on Eq. 6.4, we thus obtain the total execution time of n instructions, including the effect of branching, as follows: $$T_{eff} = k\tau + (n-1)\tau + pqnb\tau$$ Modifying Eq. 6.9, we define the following effective pipeline throughput with the influence of branching: $$H_{eff} = \frac{n}{T_{eff}} = \frac{nf}{k + n - 1 + pqnb} \tag{6.12}$$ When $n \to \infty$ , the tightest upper bound on the effective pipeline throughput is obtained when b = k - 1: $$H_{eff}^* = \frac{f}{pq(k-1)+1} \tag{6.13}$$ When p = q = 0 (no branching), the above bound approaches the maximum throughput $f = 1/\tau$ , same as in Eq. 6.2. Suppose p = 0.2, q = 0.6, and b = k - 1 = 7. We define the following performance degradation factor: $$D = \frac{f - H_{efff}^*}{f} = 1 - \frac{1}{pq(k-1)+1} = \frac{pq(k-1)}{pq(k-1)+1} = \frac{0.84}{1.84} = 0.46 \quad (6.14)$$ The above analysis implies that the pipeline performance can be degraded by 46% with branching when the instruction stream is sufficiently long. This analysis demonstrates the high degree of performance degradation caused by branching in an instruction pipeline. Fig. 6.18 The decision of a branch taken at the last stage of an instruction pipeline causes $b \le k-1$ previously loaded instructions to be drained from the pipeline **Branch Prediction** Branch can be predicted either based on branch code types statically or based on branch history during program execution. The probability of branch with respect to a particular branch instruction type can be used to predict branch. This requires collecting the frequency and probabilities of branch taken and branch types across a large number of program traces. Such a *static branch strategy* may not be very accurate. The static prediction direction (taken or not taken) can even be wired into the processor. According to past experience, the best performance is given by predicting taken. This results from the fact that most conditional branch instructions are taken in program execution. The wired-in static prediction cannot be changed once committed to the hardware. However, the scheme can be modified to allow the compiler to select the direction of each branch on a semi-static prediction basis. A dynamic branch strategy works better because it uses recent branch history to predict whether or not the branch will be taken next time when it occurs. To be accurate, one may need to use the entire history of the branch to predict the future choice. This is infeasible to implement. Therefore, most dynamic prediction is determined with limited recent history, as illustrated in Fig. 6.19. Fig. 6.19 Branch history buffer and a state transition diagram used in dynamic branch prediction (Courtesy of Lee and Smith, IEEE Computer, 1984) Cragon (1992) classified dynamic branch strategies into three major classes: One class predicts the branch direction based upon information found at the decode stage. The second class uses a cache to store target addresses at the stage the effective address of the branch target is computed. The third scheme uses a cache to store target instructions at the fetch stage. Dynamic prediction demands additional hardware to keep track of the past behavior of the branch instructions at run time. The amount of history recorded should be relatively small. Otherwise, the prediction logic becomes too costly to implement. Lee and Smith (1984) suggested the use of a branch target buffer (BTB) to implement branch prediction (Fig. 6.19a). The BTB is used to hold recent branch information including the address of the branch target used. The address of the branch instruction locates its entry in the BTB. A state transition diagram (Fig. 6.19b) has been used by Lee and Smith for tracking the last two outcomes at each branch instruction in a given program. The BTB entry contains the information which will guide the prediction. Prediction information is updated upon completion of the current branch. The BTB can be extended to store not only the branch target address but also the target instruction itself, in order to allow zero delay in converting conditional branches to unconditional branches. The *taken* (T) and *not-taken* (N) labels in the state diagram correspond to actual program behavior. Further discussion on this topic will be found in Chapter 12. **Delayed Branches** Examining the branch penalty, we realize that the branch penalty would be reduced significantly if the delay slot could be shortened or minimized to a zero penalty. The purpose of delayed branches is to make this possible, as illustrated in Fig. 6.20. The idea was originally used to reduce the branching penalty in coding microinstructions. A *delayed* branch of d cycles allows at most d-1 useful instructions to be executed following the branch taken. The execution of these instructions should be independent of the outcome of the branch instruction. Otherwise, a zero branching penalty cannot be achieved. (b) A delayed branch for 3 cycles when the branch condition is resolved at the execute stage (c) A delayed branch for 4 cycles when the branch condition is resolved at the store stage Fig. 6.20 The concept of delayed branch by moving independent instructions or NOP fillers into the delay slot of a four-stage pipeline The technique is similar to that used for software interlocking. NOPs can be used as fillers if needed. The probability of moving one instruction (d = 2 in Fig. 6.20a) into the delay slot is greater than 0.6, that of moving two instructions (d = 3 in Fig. 6.20b) is about 0.2, and that of moving three instructions (d = 4 in Fig. 6.20c) is less than 0.1, according to some program trace results. # Example 6.8 A delayed branch with code motion into a delay slot Code motion across branches can be used to achieve a delayed branch, as illustrated in Fig. 6.21. Consider the execution of a code fragment in Fig. 6.21a. The original program is modified by moving the useful instruction I1 into the delay slot after the branch instruction I3. Fig. 6.21 Code motion across a branch to achieve a delayed branch with a reduced penalty to pipeline performance In case the branch is not taken, the execution of the modified program produces the same results as the original program. In case the branch is taken in the modified program, execution of the delayed instructions I1 and I5 is needed anyway. In general, data dependence between instructions moving across the branch and the remaining instructions being scheduled must be analyzed. Since instruction I1 is independent of the remaining instructions, leaving it in the delay slot will not create logic hazards or data dependences. Sometimes NOP fillers can be inserted in the delay slot if no useful instructions can be found. However, inserting NOP fillers does not save any cycles in the delayed branch operation. From the above analysis, one can conclude that delayed branching may be more effective in short instruction pipelines with about four stages. Delayed branching has been built into some RISC processors, including the MIPS R4000 and Motorola MC88110. ## **ARITHMETIC PIPELINE DESIGN** Pipelining techniques can be applied to speed up numerical arithmetic computations. We start with a review of arithmetic principles and standards. Then we consider arithmetic pipelines with fixed functions. A fixed-point multiply pipeline design and the MC68040 floating-point unit are used as examples to illustrate the design techniques involved. A multifunction arithmetic pipeline is studied with the TI-ASC arithmetic processor as an example. #### 6.4.1 Computer Arithmetic Principles In a digital computer, arithmetic is performed with *finite precision* due to the use of fixed-size memory words or registers. Fixed-point or integer arithmetic offers a fixed range of numbers that can be operated upon. Floating-point arithmetic operates over a much increased dynamic range of numbers. In modern processors, fixed-point and floating-point arithmetic operations are very often performed by separate hardware on the same processor chip. Finite precision implies that numbers exceeding the limit must be truncated or rounded to provide a precision within the number of significant bits allowed. In the case of floating-point numbers, exceeding the exponent range means error conditions, called *overflow* or *underflow*. The Institute of Electrical and Electronics Engineers (IEEE) has developed standard formats for 32- and 64-bit floating numbers known as the *IEEE 754 Standard*. This standard has been adopted for most of today's computers. **Fixed-Point Operations** Fixed-point numbers are represented internally in machines in *sign-magnitude*, one's complement, or two's complement notation. Most computers use the two's complement notation because of its unique representation of all numbers (including zero). One's complement notation introduces a second zero representation called *dirty zero*. Add, subtract, multiply, and divide are the four primitive arithmetic operations. For fixed-point numbers, the add or subtract of two *n*-bit integers (or fractions) produces an *n*-bit result with at most one carry-out. The multiplication of two n-bit numbers produces a 2n-bit result which requires the use of two memory words or two registers to hold the full-precision result. The division of an n-bit number by another may create an arbitrarily long quotient and a remainder. Only an approximate result is expected in fixed-point division with rounding or truncation. However, one can expand the precision by using a 2n-bit dividend and an n-bit divisor to yield an n-bit quotient. **Floating-Point Numbers** A floating-point number X is represented by a pair (m, e), where m is the mantissa (or fraction) and e is the exponent with an implied base (or radix). The algebraic value is represented as $X = m \times r^e$ . The sign of X can be embedded in the mantissa. # Example 6.9 The IEEE 754 floating-point standard A 32-bit floating-point number is specified in the IEEE 754 Standard as follows: A binary base is assumed with r = 2. The 8-bit exponent e field uses an excess-127 code. The dynamic range of e is (-127, 128), internally represented as (0, 255). The sign s and the 23-bit mantissa field m form a 25-bit sign-magnitude fraction, including an implicit or "hidden" 1 bit to the left of the binary point. Thus the complete mantissa actually represents the value 1.m in binary. This hidden bit is not stored with the number. If 0 < e < 255, then a nonzero normalized number represents the following algebraic value: $$X = (-1)^{s} \times 2^{e-127} \times (1.m) \tag{6.15}$$ When e = 255 and $m \neq 0$ , a not-a-number (NaN) is represented. NaNs can be caused by dividing a zero by a zero or taking the square root of a negative number, among many other nondeterminate cases. When e = 255 and m = 0, an infinite number $X = (-1)^s \infty$ is represented. Note that $+\infty$ and $-\infty$ are represented differently. When e = 0 and $m \ne 0$ , the number represented is $X = (-1)^s 2^{-126}(0.m)$ . When e = 0 and m = 0, a zero is represented as $X = (-1)^s 0$ . Again, +0 and -0 are possible. The 64-bit (double-precision) floating point can be defined similarly using an excess-1023 code in the exponent field and a 52-bit mantissa field. A number which is nonzero, finite, non-NaN, and normalized, has the following value: $$X = (-1)^{s} \times 2^{e-1023} \times (1.m)$$ (6.16) Special rules are given in the standard to handle overflow or underflow conditions. Interested readers may check the published IEEE standards for details. **Floating-Point Operations** The four primitive arithmetic operations are defined below for a pair of floating-point numbers represented by $X = (m_x, e_x)$ and $Y = (m_y, e_y)$ . For clarity, we assume $e_x \le e_y$ and base r = 2. $$X + Y = (m_x \times 2^{e_x - e_y} + m_y) \times x^{e_y}$$ (6.17) $$X - Y = (m_x \times 2^{e_x - e_y} - m_y) \times x^{e_y}$$ (6.18) $$X \times Y = (m_x \times m_y) \times 2^{e_x + e_y} \tag{6.19}$$ $$X + Y = (m_x + m_y) \times 2^{e_x - e_y}$$ (6.20) The above equations clearly identify the number of arithmetic operations involved in each floating-point function. These operations can be divided into two halves: One half is for exponent operations such as comparing their relative magnitudes or adding/subtracting them; the other half is for mantissa operations, including four types of fixed-point operations. Floating-point units are ideal for pipelined implementation. The two halves of the operations demand almost twice as much hardware as that required in a fixed-point unit. Arithmetic shifting operations are needed for equalizing the two exponents before their mantissas can be added or subtracted. Shifting a binary fraction m to the right k places corresponds to the weighting $m \times 2^{-k}$ , and shifting k places to the left corresponds to $m \times 2^k$ . In addition, normalization of a floating-point number also requires left shifts to be performed. **Elementary Functions** Elementary functions include trigonometric, exponential, logarithmic, and other transcendental functions. Truncated polynomials or power series can be used to evaluate the elementary functions, such as $\sin x$ , $\ln x$ , $e^x$ , $\cosh x$ , $\tan^{-1} y$ , $\sqrt{x}$ , $x^3$ , etc. Interested readers may refer to the book by Hwang (1979) for details of computer arithmetic functions and their hardware implementation. It should be noted that computer arithmetic can be implemented by hardwired logic circuitry as well as by table lookup using fast memory. Frequently used constants and special function values can also be generated by table lookup. # 6.4.2 Static Arithmetic Pipelines Most of today's arithmetic pipelines are designed to perform fixed functions. These arithmetic/logic units (ALUs) perform fixed-point and floating-point operations separately. The fixed-point unit is also called the integer unit. The floating-point unit can be built either as part of the central processor or on a separate coprocessor. These arithmetic units perform scalar operations involving one pair of operands at a time. The pipelining in scalar arithmetic pipelines is controlled by software loops. Vector arithmetic units can be designed with pipeline hardware directly under firmware or hardwired control. Scalar and vector arithmetic pipelines differ mainly in the areas of register files and control mechanisms involved. Vector hardware pipelines are often built as add-on options to a scalar processor or as an attached processor driven by a control processor. Both scalar and vector processors are used in modern supercomputers. Arithmetic Pipeline Stages Depending on the function to be implemented, different pipeline stages in an arithmetic unit require different hardware logic. Since all arithmetic operations (such as add, subtract, multiply, divide, squaring, square rooting, logarithm, etc.) can be implemented with the basic add and shifting operations, the core arithmetic stages require some form of hardware to add and to shift. For example, a typical three-stage floating-point adder includes a first stage for exponent comparison and equalization which is implemented with an integer adder and some shifting logic; a second stage for fraction addition using a high-speed carry lookahead adder; and a third stage for fraction normalization and exponent readjustment using a shifter and another addition logic. Arithmetic or logical shifts can be easily implemented with *shift registers*. High-speed addition requires either the use of a *carry-propagation adder* (CPA) which adds two numbers and produces an arithmetic sum as shown in Fig. 6.22a, or the use of a *carry-save adder* (CSA) to "add" three input numbers and produce one sum output and a carry output as exemplified in Fig. 6.22b. In a CPA, the carries generated in successive digits are allowed to propagate from the low end to the high end, using either ripple carry propagation or some carry looka-head technique. In a CSA, the carries are not allowed to propagate but instead are saved in a carry vector. In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input numbers, expressed as $X = (x_{n-1}, x_{n-2}, ..., x_1, x_0)$ and so on. The CSA performs bitwise operations simultaneously on all columns of digits to produce two n-bit output numbers, denoted as $S^b = (0, S_{n-1}, S_{n-2}, ..., S_1, S_0)$ and $C = (C_n, C_{n-1}, ..., C_1, 0)$ . Note that the leading bit of the *bitwise sum* $S^b$ is always a 0, and the tail bit of the *carry vector* C is always a 0. The input-output relationships are expressed below: $$S_i = x_i \oplus y_i \oplus z_i$$ $$C_{i+1} = x_i y_i \vee y_{z_i} \vee z_i x_i$$ (6.21) (a) An *n*-bit carry-propagate adder (CPA) which allows either carry propagation or applies the carry-lookahead technique (b) An n-bit carry-save adder (CSA), where S<sup>b</sup> is the bitwise sum of X, Y, and Z, and C is a carry vector generated without carry propagation between digits Fig. 6.22 Distinction between a carry-propagate adder (CPA) and a carry-save adder (CSA) for i = 0, 1, 2, ..., n - 1, where $\oplus$ is the exclusive OR and $\vee$ is the logical OR operation. Note that the arithmetic sum of three input numbers, i.e., S = X + Y + Z, is obtained by adding the two output numbers, i.e., $S = S^b + C$ , using a CPA. We use the CPA and CSAs to implement the pipeline stages of a fixed-point multiply unit as follows. **Multiply Pipeline Design** Consider as an example the multiplication of two 8-bit integers $A \times B = P$ , where P is the 16-bit product. This fixed-point multiplication can be written as the summation of eight partial products as shown below: $P = A \times B = P_0 + P_1 + P_2 + \cdots + P_7$ , where $\times$ and + are arithmetic multiply and add operations, respectively. | • | | - | | | | | | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | = | A | |----|---|---|---|---|---|---|----|---|---|---|---|---|---|---|---|---|----------------| | | | | | | | | x) | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | = | В | | | | | | | | | | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | = | $P_0$ | | | | | | | | | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | = | $P_1$ | | | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | = | $P_2$ | | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | = | $P_3$ | | | | | | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | = | $P_4$ | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | = | $P_5$ | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | = | $P_6$ | | +) | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | = | P <sub>7</sub> | | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | = | P | Note that the partial product $P_j$ is obtained by multiplying the multiplicand A by the jth bit of B and then shifting the result j bits to the left for j = 0, 1, 2, ..., 7. Thus $P_j$ is (8 + j) bits long with j trailing zeros. The summation of the eight partial products is done with a *Wallace tree* of CSAs plus a CPA at the final stage, as shown in Fig. 6.23. The first stage $(S_1)$ generates all eight partial products, ranging from 8 bits to 15 bits, simultaneously. The second stage $(S_2)$ is made up of two levels of four CSAs, and it essentially merges eight numbers into four numbers ranging from 13 to 15 bits. The third stage $(S_3)$ consists of two CSAs, and it merges four numbers from $S_2$ into two 16-bit numbers. The final stage $(S_4)$ is a CPA, which adds up the last two numbers to produce the final product P. Fig. 6.23 A pipeline unit for fixed-point multiplication of 8-bit integers (The number along each line indicates the line width.) For a maximum width of 16 bits, the CPA is estimated to need four gate levels of delay. Each level of the CSA can be implemented with a two-gate-level logic. The delay of the first stage $(S_1)$ also involves two gate levels. Thus all the pipeline stages have an approximately equal amount of delay. 260 The matching of stage delays is crucial to the determination of the number of pipeline stages, as well as the clock period (Eq. 6.1). If the delay of the CPA stage can be further reduced to match that of a single CSA level, then the pipeline can be divided into six stages with a clock rate twice as fast. The basic concepts can be extended to operands with a larger number of bits, as we see in the example below. # Example 6.10 The floating-point unit in the Motorola MC68040 Figure 6.24 shows the design of a pipelined floating-point unit built as an on-chip feature in the Motorola M68040 processor. Fig. 6.24 Pipelined floating-point unit of the Motorola MC68040 processor (Courtesy of Motorola, Inc., 1992) This arithmetic pipeline has three stages. The mantissa section and exponent section are essentially two separate pipelines. The mantissa section can perform floating-point add or multiply operations, either single-precision (32 bits) or double-precision (64 bits). In the mantissa section, stage 1 receives input operands and returns with computation results; 64-bit registers are used in this stage. Note that all three stages are connected to two 64-bit data buses. Stage 2 contains the array multiplier $(64 \times 8)$ which must be repeatedly used to carry out a long multiplication of the two mantissas. The 67-bit adder performs the addition/subtraction of two mantissas, the barrel shifter is used for normalization. Stage 3 contains registers for holding results before they are loaded into the register file in stage 1 for subsequent use by other instructions. On the exponent side, a 16-bit bus is used between stages. Stage 1 has an exponent adder for comparing the relative magnitude of two exponents. The result of stage 1 is used to equalize the exponents before mantissa addition can be performed. Therefore, a shift count (from the output of the exponent adder) is sent to the barrel shifter for mantissa alignment. After normalization of the final result (getting rid of leading zeros), the exponent needs to be readjusted in stage 3 using another adder. The final value of the resulting exponent is fed from the register in stage 3 to the register file in stage 1, ready for subsequent usage. **Convergence Division** One technique for division involves repeated multiplications. Mantissa division is carried out by a *convergence method*. This convergence division obtains the quotient Q = M/D of two normalized fractions $0.5 \le M < D < 1$ in two's complement notation by performing two sequences of chain multiplications as follows: $$Q = \frac{M \times R_1 \times R_2 \times \dots \times R_k}{D \times R_1 \times R_2 \times \dots \times R_k}$$ (6.22) where the successive multipliers $$R_i = 1 + \delta^{2^{i-1}} = 2 - D^{(i)}$$ for $i = 1, 2, ..., k$ and $D = 1 - \delta$ The purpose is to choose $R_i$ such that the denominator $D^{(k)} = D \times R_1 \times R_2 \times \cdots \times R_k \to 1$ for a sufficient number of k iterations, and then the resulting numerator $M \times R_1 \times R_2 \times \cdots \times R_k \to Q$ . Note that the multiplier $R_i$ can be obtained by finding the two's complement of the previous chain product $D^{(i)} = D \times R_1 \times \cdots \times R_{i-1} = 1 - \delta^{2i-1}$ because $2 - D^{(i)} = R_i$ . The reason why $D^{(k)} \to 1$ for large k is that $$D^{(i)} = (1 - \delta)(1 + \delta)(1 + \delta^{2})(1 + \delta^{4}) \cdots (1 + \delta^{2^{i-1}})$$ $$= (1 - \delta^{2})(1 + \delta^{2})(1 + \delta^{4}) \cdots (1 + \delta^{2^{i-1}})$$ $$= (1 - \delta^{2^{i}}) \text{ for } i = 1, 2, ..., k$$ (6.23) Since $0 < \delta = 1 - D \le 0.5$ , $\delta^{2^i} \to 0$ as *i* becomes sufficiently large, say, i = k for some *k*; thus $D^{(k)} = 1 - \delta^{2^k} = 1$ for large *k*. The end result is $$Q = M \times (1 + \delta) \times (1 + \delta^2) \times \dots \times (1 + \delta^{2^{k-1}})$$ (6.24) The above two sequences of chain multiplications are carried out alternately between the numerator and denominator through the pipeline stages. To summarize, in this technique division is carried out by repeated multiplications. Thus divide and multiply can share the same hardware pipeline. # Example 6.11 The IBM 360/Model 91 floating-point unit design In the history of building scientific computers, IBM 360 Model 91 was certainly a milestone. Many of the pipeline design features introduced in previous sections were implemented in this machine. Therefore, it is worth the effort to examine the architecture of Model 91. In particular, we describe how floating-point add and multiply/divide operations were implemented in this machine. As shown in Fig. 6.25, the floating-point execution unit in Model 91 consisted of two separate functional pipelines: the *add unit* and the *multiply/divide unit*, which could be used concurrently. The former was a two-stage pipeline, and the latter was a six-stage pipeline. Fig. 6.25 The IBM 360 Model 91 floating-point unit (Courtesy of IBM Corporation, 1967) The floating-point operation stack was a kind of prefetch buffer holding eight floating-point instructions for subsequent execution through the two functional pipelines. The floating-point buffers were used to input operands. Operands may also come from the floating-point registers which were connected via the common data bus to the output bus. Results from the two functional units could be sent back to the memory via the store data buffers, or they could be routed back to the FLR or to the reservation stations at the input ends. The add unit allowed three pairs of operands to be loaded into three reservation stations. Only one pair could be used at a time. The other two pairs held operands for subsequent use. The use of these reservation stations made the add unit behave like three virtual functional units. Similarly, the two pairs at the input end of the multiply/divide unit made it behave like two virtual units. Internal data forwarding in Model 91 was accomplished using source tags on all registers and reservation stations. Divide was implemented in Model 91 based on the convergence method. Every source of an input operand was uniquely identified with a 4-bit tag. Every destination of an input operand had an associated tag register that held the tag naming the source of data if the destination was busy. Through this register tagging technique, operands/results could be directly passed among the virtual functional units. This forwarding significantly cut down the data flow time between them. Dynamic scheduling logic was built into Model 91 using Tomasulo's algorithm to resolve the data dependence problem. Either the add unit or the multiply/divide unit could execute an operation using operands from one of the reservation stations. Under Tomasulo's algorithm, data dependences are preserved by copying source tags when the sources are busy. When data is generated by a source, it passes its identification and the data onto the common data bus. Awaiting destinations continuously monitor the bus in a tag watch. When the source tag matches, the destination takes in the data from the bus. Other variations of Tomasulo's algorithm can be made to store the source tags within the destinations, to use a special tag (such as 0000) to indicate nonbusy register/buffers, or to use direct-mapped tags to avoid associative hardware. Besides the IBM 360/370, the CDC 6600/7600 also implemented convergence division. It took two pipeline cycles to perform the floating-point add, six cycles to multiply, and 18 cycles to divide in the IBM System/360 Model 91 due to five iterations involved in the convergence division process. ## 6.4.3 Multifunctional Arithmetic Pipelines Static arithmetic pipelines are designed to perform a fixed function and are thus called *unifunctional*. When a pipeline can perform more than one function, it is called *multifunctional*. A multifunctional pipeline can be either *static* or *dynamic*. Static pipelines perform *one* function at a time, but different functions can be performed at different times. A dynamic pipeline allows several functions to be performed simultaneously through the pipeline, as long as there are no conflicts in the shared usage of pipeline stages. In this section, we study a static multifunctional pipeline which was designed into the TI Advanced Scientific Computer (ASC). # Example 6.12 The TI/ASC arithmetic processor design There were four pipeline arithmetic units built into the TI-ASC system, as shown in Fig. 6.26. The instruction-processing unit handled the fetching and decoding of instructions. There were a large number of working registers in the processor which also controlled the operations of the memory buffer unit and of the arithmetic units. There were two sets of operand buffers, $\{X, Y, Z\}$ and $\{X', Y, Z'\}$ , in each arithmetic unit. X', X, Y' and Y were used for input operands, and Z' and Z were used to output results. Note that intermediate results could be also routed from Z-registers to either X- or Y-registers. Both processor and memory buffers accessed the main memory for instructions and operands/results, respectively. Each pipeline arithmetic unit had eight stages as shown in Fig. 6.27a. The PAU was a static multifunction pipeline which could perform only one function at a time. Figure 6.27a shows all the possible interstage connections for performing arithmetic, logical, shifting, and data conversion functions. Fig. 6.26 The architecture of the TI Advanced Scientific Computer (ASC) (Courtesy of Texas Instruments, Inc.) Both fixed-point and floating-point arithmetic functions could be performed by this pipeline. The PAU also supported vector in addition to scalar arithmetic operations. It should be noted that different functions required different pipeline stages and different interstage connection patterns. Fig. 6.27 The multiplication arithmetic pipeline of the TI Advanced Scientific Computer and the interstage connections of two representative functions (Shaded stages are unutilized) For example, fixed-point multiplication required the use of only segments $S_1$ , $S_6$ , $S_7$ , and $S_8$ as shown in Fig. 6.27b. On the other hand, the floating-point dot product function, which performs the dot product operation between two vectors, required the use of all segments with the complex connections shown in Fig. 6.27c. This dot product was implemented by essentially the following accumulated summation of a sequence of multiplications through the pipeline: $$Z \leftarrow A_i \times B_i + Z \tag{6.25}$$ where the successive operands $(A_i, B_i)$ were fed through the X- and Y-buffers, and the accumulated sums through the Z-buffer recursively. The entire pipeline could perform the multiply (×) and the add (+) in a single flow through the pipeline. The two levels of buffer registers isolated the loading and fetching of operands to or from the PAU, respectively, as in the concept of using a pair in the prefetch buffers described in Fig. 6.11. Even though the TI-ASC is no longer in production, the system provided a unique design for multifunction arithmetic pipelines. Today, most supercomputers implement arithmetic pipelines with dedicated functions for much simplified control circuitry and faster operations. ## SUPERSCALAR PIPELINE DESIGN **Pipeline Design Parameters** Some parameters used in designing the scalar base processor and superscalar processor are summarized in Table 6.1 for the pipeline processors to be studied below. All pipelines discussed are assumed to have k stages. The pipeline cycle for the scalar base processor is assumed to be 1 time unit, called the base cycle. We defined the instruction issue rate, issue latency, and simple operation latency in Section 4.1.1. The instruction-level parallelism (ILP) is the maximum number of instructions that can be simultaneously executed in the pipeline. For the base processor, all of these parameters have a value of 1. All processor types are designed relative to the base processor. The ILP is needed to fully utilize a given pipeline processor. Table 6.1 Design Parameters for Pipeline Processors | Machine type | Scalar base machine of k pipeline stages | Superscalar machine of degree n | |-----------------------------------|------------------------------------------|---------------------------------| | Machine pipeline cycle | 1 (base cycle) | 1 | | Instruction issue rate | 1 | m | | Instruction issue latency | 1 | 1 | | Simple operation latency | 1. | 1 | | ILP to fully utilize the pipeline | 1 | m | Note: All timing is relative to the base cycle for the scalar base machine. ILP: Instruction level parallelism. We study below the structure of superscalar pipelines, the data dependence problem, the factors causing pipeline stalling, and multi-instruction issuing mechanisms for achieving parallel pipelining operations. For a superscalar machine of degree m, m instructions are issued per cycle and the ILP should be m in order to fully utilize the pipeline. As a matter of fact, the scalar base processor can be considered a degenerate case of a superscalar processor of degree 1. **Superscalar Pipeline Structure** In an *m*-issue superscalar processor, the instruction decoding and execution resources are increased to form effectively *m* pipelines operating concurrently. At some pipeline stages, the functional units may be shared by multiple pipelines. This resource-shared multiple-pipeline structure is illustrated by a design example in Fig. 6.28a. In this design, the processor can issue two instructions per cycle if there is no resource conflict and no data dependence problem. There are essentially two pipelines in the design. Both pipelines have four processing stages labeled fetch, decode, execute, and store, respectively. (a) A dual-pipleline, superscalar processor with four functional units in the execution stage and a lookahead window producing out-of-order issues (b) A sample program and its dependence graph, where I2 and I3 share the adder and I4 and I6 share the multiplier Fig. 6.28 A two-issue superscalar processor and a sample program for parallel execution Each pipeline essentially has its own fetch unit, decode unit, and store unit. The two instruction streams flowing through the two pipelines are retrieved from a single source stream (the I-cache). The fan-out from a single instruction stream is subject to resource constraints and a data dependence relationship among the successive instructions. For simplicity, we assume that each pipeline stage requires one cycle, except the execute stage which may require a variable number of cycles. Four functional units, multiplier, adder, logic unit, and load unit, are available for use in the execute stage. These functional units are shared by the two pipelines on a dynamic basis. The multiplier itself has three pipeline stages, the adder has two stages, and the others each have only one stage. 268 The two store units (S1 and S2) can be dynamically used by the two pipelines, depending on availability at a particular cycle. There is a *lookahead window* with its own fetch and decoding logic. This window is used for instruction lookahead in case out-of-order instruction issue is desired to achieve better pipeline throughput. It requires complex logic to schedule multiple pipelines simultaneously, especially when the instructions are retrieved from the same source. The aim is to avoid pipeline stalling and minimize pipeline idle time. **Data Dependences** Consider the example program in Fig. 6.28b. A dependence graph is drawn to indicate the relationship among the instructions. Because the register content in R1 is loaded by I1 and then used by I2, we have flow dependence: $11 \rightarrow 12$ . Because the result in register R4 after executing I4 may affect the operand register R4 used by I3, we have antidependence: I3 $+\rightarrow$ I4. Since both I5 and I6 modify the register R6, and R6 supplies an operand for I6, we have both flow and output dependence: I5 $\rightarrow$ I6 and I5 $\leftrightarrow$ I6 as shown in the dependence graph. To schedule instructions through one or more pipelines, these data dependences must not be violated. Otherwise, erroneous results may be produced. **Pipeline Stalling** This is a problem which may seriously lower pipeline utilization. Proper scheduling avoids pipeline stalling. The problem exists in both scalar and superscalar processors. However, it is more serious in a superscalar pipeline. Stalling can be caused by data dependences or by resource conflicts among instructions already in the pipeline or about to enter the pipeline. We use an example to illustrate the conditions causing pipeline stalling. Consider the scheduling of two instruction pipelines in a two-issue superscalar processor. Figure 6.29a shows the case of no data dependence on the left and flow dependence (I1 $\rightarrow$ I2) on the right. Without data dependence, all pipeline stages are utilized without idling. With dependence, instruction 12 entering the second pipeline must wait for two cycles (shaded time slots) before entering the execution stages. This delay may also pass to the next instruction I4 entering the pipeline. In Fig. 6.29b, we show the effect of branching (instruction I2). A delay slot of four cycles results from a branch taken by I2 at cycle 5. Therefore, both pipelines must be flushed before the target instructions I3 and I4 can enter the pipelines from cycle 6. Here, delayed branch or other amending actions are not taken. In Fig. 6.29c, we show a combined problem involving both resource conflict and data dependence. Instructions I1 and I2 need to use the same functional unit, and I2 $\rightarrow$ I4 exists. The net effect is that I2 must be scheduled one cycle behind because the two pipeline stages $(e_1 \text{ and } e_2)$ of the same functional unit must be used by 11 and I2 in an overlapped fashion. For the same reason, I3 is also delayed by one cycle. Instruction I4 is delayed by two cycles due to the flow dependence on I2. The shaded boxes in all the timing charts correspond to idle stages. Superscalar Pipeline Scheduling Instruction issue and completion policies are critical to superscalar processor performance. Three scheduling policies are introduced below. When instructions are issued in program order, we call it *in-order issue*. When program order is violated, *out-of-order issue* is being practiced. Similarly, if the instructions must be completed in program order, it is called *in-order completion*. Otherwise, *out-of-order completion* may result. In-order issue is easier to implement but may not yield the optimal performance. In-order issue may result in either in-order or out-of-order completion. (a) Data dependence stalls the second pipeline in shaded cycles (b) Branch instruction 12 causes a delay slot of length 4 in both pipelines (c) Resource conflicts and data dependences cause the stalling of pipeline operations for some cycles Fig. 6.29 Dependences and resource conflicts may stall one or two pipelines in a two-issue superscalar processor Out-of-order issue usually ends up with out-of-order completion. The purpose of out-of-order issue and completion is to improve performance. These three scheduling policies are illustrated in Fig. 6.30 by execution of the example program in Fig. 6.28b on the dual-pipeline hardware in Fig. 6.28a. It is demonstrated that performance can be improved from an in-order to an out-of-order schedule. The performance is often indicated by the total execution time and the utilization rate of pipeline stages. Not all programs can be scheduled out of order. Data dependence and resource conflicts do impose constraints. In-Order Issue Figure 6.30a shows a schedule for the six instructions being issued in program order II, I2, ..., I6. Pipeline 1 receives I1, I3, and I5, and pipeline 2 receives I2, I4, and I6 in three consecutive cycles. Due to I1 $\rightarrow$ I2, I2 has to wait one cycle to use the data loaded in by I1. I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the result of I5 before it can enter the multiplier stages. In order to maintain in-order completion, I5 is forced to wait for two cycles to come out of pipeline 1. In total, nine cycles are needed and five idle cycles (shaded boxes) are observed. 270 In Fig. 6.30b, out-of-order completion is allowed even if in-order issue is practiced. The only difference between this out-of-order schedule and the in-order schedule is that I5 is allowed to complete ahead of I3 and I4, which are totally independent of 15. The total execution time does not improve. However, the pipeline utilization rate does. Only three idle cycles are observed. Note that in Figs. 6.29a and 6.29b, we did not use the lookahead window. In order to shorten the total execution time, the window can be used to reorder the instruction issues. (a) In-order issue with in-order completion in nine cycles (b) In-order issue and out-of-order completion in nine cycles (c) Out-of-order issue and out-of-order completion in seven cycles using an instruction lookahead window in the recoding process Fig. 6.30 Instruction issue and completion policies for a superscalar processor with and without instruction lookahead support (Timing charts correspond to parallel execution of the program in Fig. 6.28) Out-of-Order Issue By using the lookahead window, instruction I5 can be decoded in advance because it is independent of all the other instructions. The six instructions are issued in three cycles as shown: I5 is fetched and decoded by the window, while I3 and I4 are decoded concurrently. It is followed by issuing I6 and I1 at cycle 2, and I2 at cycle 3. Because the issue is out of order, the completion is also out of order as shown in Fig. 6.30c. Now, the total execution time has been reduced to seven cycles with no idle stages during the execution of these six instructions. The in-order issue and completion is the simplest one to implement. It is rarely used today even in a conventional scalar processor due to some unnecessary delays in maintaining program order. However, in a multiprocessor environment, this policy is still attractive. Allowing out-of-order completion can be found in both scalar and superscalar processors. Some long-latency operations, such as loads and floating-point operations, can be hidden in out-of-order completion to achieve a better performance. Output dependence and antidependence are the two relations preventing out-of-order completion. Out-of-order issue gives the processor more freedom to exploit parallelism, and thus pipeline efficiency is enhanced. The above example clearly demonstrates the advantages of instruction lookahead and of out-of-order issue and completion as far as pipeline optimization is concerned. It should be noted that multiple-pipeline scheduling is an NP-complete problem. Optimal scheduling is very expensive to obtain. Simple data dependence checking, a small lookahead window, and scoreboarding mechanisms are needed, along with an optimizing compiler, to exploit instruction parallelism in a superscalar processor. **Motorola 88110 Architecture** The Motorola 88110 was an early superscalar RISC processor. It combined the three-chip set, one CPU (88100) chip and two cache (88200) chips, in a single-chip implementation, with additional improvements. The 88110 employed advanced techniques for exploiting instruction-level parallelism, including instruction issue, out-of-order instruction completion, speculative execution, dynamic instruction rescheduling, and two on-chip caches. The unit also supported demanding graphics and digital signal processing applications. The 88110 employed a symmetrical superscalar instruction dispatch unit which dispatched two instructions each clock cycle into an array of 10 concurrent units. It allowed out-of-order instruction completion and some out-of-order instruction issue, and branch prediction with speculative execution past branches. The instruction set of the 88110 extended that of the 88100 in integer and floating-point operations. It added a new set of capabilities to support 3-D color graphics image rendering. The 88110 had separate, independent instruction and data paths, along with split caches for instructions and data. The instruction cache was 8K-byte, 2-way set-associative with 128 sets, two blocks for each set, and 32 bytes (8 instructions) per block. The data cache resembled that of the instruction set. The 88110 employed the MESI cache coherence protocol. A write-invalidate procedure guaranteed that one processor on the bus had a modified copy of any cache block at any time. The 88110 was implemented with 1.3 million transistors in a 299-pin package and driven by a 50-MHz clock. Interested readers may refer to Diefendorff and Allen (1992) for details. **Superscalar Performance** To compare the relative performance of a superscalar processor with that of a scalar base machine, we estimate the ideal execution time of N independent instructions through the pipeline. The time required by the scalar base machine is $$T(1, 1) = k + N - 1$$ (base cycles) (6.26) The ideal execution time required by an m-issue superscalar machine is $$T(m, 1) = k + \frac{N - m}{m} \text{ (base cycles)}$$ (6.27) where k is the time required to execute the first m instructions through the m pipelines simultaneously, and the second term corresponds to the time required to execute the remaining N-m instructions, m per cycle, through m pipelines. The ideal speedup of the superscalar machine over the base machine is $$S(m,1) = \frac{T(1,1)}{T(m,1)} = \frac{N+k-1}{N/m+k-1} = \frac{m(N+k-1)}{N+m(k-1)}$$ (6.28) As $N \to \infty$ , the speedup limit $S(m, 1) \to m$ , as expected. ## Example 6.13 DEC Alpha 21064 superscalar architecture As illustrated in Fig. 6.31, this was a 64-bit superscalar processor. The design emphasized speed, multiple-instruction issue, multiprocessor applications, software migration from the VAX/VMS and MIPS/OS, and a long list of usable features. The clock rate was 150 MHz with the first chip implementation. Fig. 6.31 Architecture of the DEC Alpha 21064 processor (Courtesy of Digital Equipment Corporation) Unlike others, the Alpha architecture had thirty-two 64-bit integer registers and thirty-two 64-bit floating-point registers. The integer pipeline had 7 stages, and the floating-point pipeline had 10 stages. All Alpha instructions were 32 bits. The first Alpha implementation issued two instructions per cycle, with larger number of issues in later implementations. Pipeline timing hazards, load delay slots, and branch delay slots were all minimized by hardware support. The Alpha was designed to support fast multiprocessor interlocking and interrupts. A privileged library of software was developed to run full VMS and to run OSF/1 using different versions of the software library that mirrored many of the VAX/VMS and MIPS/OS features, respectively. This library made Alpha an attractive architecture for multiple operating systems. The processor was designed to have a 300-MIPS peak and a 150-Mflops peak at 150 MHz. ### Note 6.2 Innovation versus commercial success The relationship between innovative design ideas and the commercial success of a product is not always simple, as an idealist may believe. Most of the processors used as examples in this chapter are no longer in commercial production. Rapid advances in technology and immense pressures from the market-place are usually the two main reasons behind the introduction and the demise of newer processor models. However, the innovative design ideas introduced in a new processor often have a life longer than the processor itself, since these same ideas are often carried forward in subsequent designs of the same or other processor families. For example, IBM 360/91, Motorola 68040, Motorola 88110 and DEC Alpha 21064 were all recognized for their innovative designs when they were introduced, but they achieved different degrees of commercial success. Our aim in this book is to study the innovative ideas embodied in processor and system designs; but we must also appreciate that the commercial success of a product often depends on many other crucial factors. ## Summary Instruction pipelines in processors usually have a linear structure—the execution of each instruction progresses linearly, one stage at a time, from the first to the last pipeline stage. In theory, such a linear pipeline can be designed with synchronous or asynchronous timing mode; in practice, processor pipelines today operate in synchronous mode, i.e. with a common clock signal. We studied the timing and clocking requirements of linear pipelines, and discussed the related speedup, efficiency and throughput issues. A simple model was presented which can be used in determining the optimal number of pipeline stages, based on a trade-off between cost and throughput. Dynamic or nonlinear pipelines are designed to perform a number of different functions, by appropriate scheduling of operations on the pipeline stages. Reservation tables are used for different functions; collision free schedules and latency analysis are needed for efficient operation of nonlinear pipelines. We studied how concepts of collision vectors, state transition diagrams and greedy cycles are used to determine bounds on minimum average latency (MAL), and thereby optimum schedules in terms of MAL. For a given machine instruction set, instruction pipeline design begins with analysis of the execution phases of instructions through the processor; we used the MIPS R4000 instruction pipeline as a specific example. Processor performance can be enhanced by techniques such as prefetch buffers, multiple functional units, and data forwarding; in addition, hazard avoidance is a constant goal in pipeline design and scheduling. Dynamic instruction scheduling was discussed, with a look at both Tomasulo's algorithm and the technique of scoreboarding developed at CDC. Branches in the flow of execution of instructions have a major impact on pipeline performance, since they may result in the instruction pipeline being flushed. We used a simple model to estimate the effect of branches on processor throughput, and discussed several useful branch handling techniques such as dynamic branch prediction, branch target buffer, and delayed branch. We reviewed the standard IEEE floating point representation and the basic principles of floating point arithmetic. Principles of static and multifunctional arithmetic pipelines were studied, with specific examples of arithmetic pipeline design from Motorola 68040, IBM 360/91, and TI Advanced Scientific Computer. A superscalar pipeline is one in which multiple instructions can be issued in parallel in each clock cycle, so as to better exploit instruction level parallelism in the running program. In this process, data dependences, anti-dependences and output dependences between instructions must also be respected. We reviewed in-order versus out-of-order instruction issue, and carried out basic performance analysis of superscalar pipelines. Motorola 88110 and DEC Alpha 21064 processors were used as specific examples. #### Exercises **Problem 6.1** Consider the execution of a program of 15,00,000 instructions by a linear pipeline processor with a clock rate of 1000 MHz. Assume that the instruction pipeline has five stages and that one instruction is issued per clock cycle. The penalties due to branch instructions and out-of-sequence executions are ignored. - (a) Calculate the speedup factor in using this pipeline to execute the program as compared with the use of an equivalent nonpipelined processor with an equal amount of flowthrough delay. - (b) What are the efficiency and throughput of this pipelined processor? **Problem 6.2** Study the DEC Alpha architecture in Example 6.13, find more information on DEC Alpha on the web and then answer the following questions with reasoning: - (a) Analyze the scalability of the Alpha processor implementation in terms of superscalar degree. - (b) Analyze the scalability of an Alpha-based multiprocessor system in terms of address space and multiprocessor support. **Problem 6.3** Find the optimal number of pipeline stages $k_0$ given in Eq. 6.7 using the performance/cost ratio (PCR) given in Eq. 6.6. **Problem 6.4** Prove the lower bound and upper bound on the minimal average latency (MAL) specified in Section 6.2.3. Problem 6.5 Consider the following reservation table for a four-stage pipeline with a clock cycle $\tau = 2$ ns. | | 1 | 2 | 3 | 4 | 5 | 6 | |-----------|---|---|---|---|---|---| | S1 | X | | | | | X | | <b>S2</b> | | X | | X | | | | S3<br>S4 | | | X | | | | | <b>S4</b> | | | | X | X | | - (a) What are the forbidden latencies and the initial collision vector? - (b) Draw the state transition diagram for scheduling the pipeline. - (c) Determine the MAL associated with the shortest greedy cycle. - (d) Determine the pipeline throughput corresponding to the MAL and given $\tau$ . - (e) Determine the lower bound on the MAL for this pipeline. Have you obtained the optimal latency from the above state diagram? **Problem 6.6** You are allowed to insert one noncompute delay stage into the pipeline in Problem 6.5 to make a latency of 1 permissible in the shortest greedy cycle. The purpose is to yield a new reservation table leading to an optimal latency equal to the lower bound. - (a) Show the modified reservation table with five rows and seven columns. - (b) Draw the new state transition diagram for obtaining the optimal cycle. - (c) List all the simple cycles and greedy cycles from the state diagram. - (d) Prove that the new MAL equals the lower bound. - (e) What is the optimal throughput of this pipeline? Indicate the percentage of throughput improvement compared with that obtained in part (d) of Problem 6.5. **Problem 6.7** Consider an adder pipeline with four stages as shown below. The pipeline consists of input lines X and Y and output line Z. The pipeline has a register R at its output where the temporary result can be stored and fed back to S1 at a later point in time. The inputs X and Y are multiplexed with the outputs R and Z. - (a) Assume the elements of the vector A are fed into the pipeline through input X, one element per cycle. What is the minimum number of clock cycles required to compute the sum of an N-element vector A: $s = \sum_{l=1}^{N} A(l)$ ? In the absence of an operand, a value of 0 is input into the pipeline by default. Neglect the setup time for the pipeline. - (b) Let $\tau$ be the clock period of the pipelined adder. Consider an equivalent nonpipelined adder with a flow-through delay of $4\tau$ . Find the actual speedup $S_4(64)$ and the efficiency $\eta_4(64)$ of using the above pipeline adder for N=64. - (c) Find the maximum speedup $S_4(\infty)$ and the efficiency $\eta_4(\infty)$ when N tends to infinity. - (d) Find $N_{1/2}$ , the minimum vector length required to achieve half of the maximum speedup. **Problem 6.8** Consider the following pipeline reservation table. | | 1 | 2 | 3 | 4 | |-----------|---|---|---|---| | <b>S1</b> | X | | | Χ | | <b>S2</b> | | Х | | | | <b>S3</b> | | | Х | | - (a) What are the forbidden latencies? - (b) Draw the state transition diagram. - (c) List all the simple cycles and greedy cycles. - (d) Determine the optimal constant latency cycle and the minimal average latency. - (e) Let the pipeline clock period be $\tau = 2$ ns. Determine the throughput of this pipeline. 276 **Problem 6.9** Consider the five-stage pipelined processor specified by the following reservation table: | | 1 | 2 | 3 | 4 | 5 | 6 | |----------------|---|---|---|---|---|---| | <b>\$1</b> | Х | | | | | Χ | | S2 | | Χ | | | Х | | | S3<br>S4<br>S5 | | | × | | | | | <b>S4</b> | | | | X | | | | S <b>5</b> | | X | | | | Х | - (a) List the set of forbidden latencies and the collision vector. - (b) Draw a state transition diagram showing all possible initial sequences (cycles) without causing a collision in the pipeline. - (c) List all the simple cycles from the state diagram. - (d) Identify the greedy cycles among the simple cycles. - (e) What is the minimum average latency (MAL) of this pipeline? - (f) What is the minimum allowed constant cycle in using this pipeline? - (g) What will be the maximum throughput of this pipeline? - (h) What will be the throughput if the minimum constant cycle is used? **Problem 6.10** The following assembly code is to be executed in a three-stage pipelined processor with hazard detection and resolution in each stage. The stages are instruction fetch, operand fetch (one or more as required), and execution (including a write-back operation). Explain all possible hazards in the execution of the code. Inc R0 $$/R0 \leftarrow (R0) + 1/$$ Mul ACC, R0 $/ACC \leftarrow (ACC) \times (R0)/$ Store R1, ACC $/R1 \leftarrow (ACC)/$ Add ACC, R0 $/ACC \leftarrow (ACC) + (R0)/$ Store M,ACC $/M \leftarrow (ACC)/$ **Problem 6.11** Consider the following pipelined processor with four stages. This pipeline has a total evaluation time of six clock cycles. All successor stages must be used after each clock cycle. - (a) Specify the reservation table for this pipeline with six columns and four rows. - (b) List the set of forbidden latencies between task initiations. - (c) Draw the state diagram which shows all possible latency cycles. - (d) List all greedy cycles from the state diagram. - (e) What is the value of the minimal average latency? - (f) What is the maximal throughput of this pipeline? **Problem 6.12** Three functional pipelines $f_1$ , $f_2$ , and $f_3$ are characterized by the following reservation tables. Using these three pipelines, a composite pipeline network is formed below: $f_1$ : | | 1 | 2 | 3 | 4 | |-----------|---|---|---|---| | SI | X | | | | | <b>S2</b> | | Х | | | | S3 | | | X | Х | $f_2$ : | | 1 | 2 | 3 | 4 | |----|---|---|---|---| | T1 | X | | | Х | | T2 | | X | | | | T3 | | | X | | $f_3$ : | | 1 | 2 | 3 | 4 | |----|---|---|---|---| | U1 | X | | X | | | U2 | | | | Х | | U3 | | X | | | Each task going through this composite pipeline uses the pipeline in the following order: $f_1$ first, $f_2$ and $f_3$ next, $f_1$ again, and then the output is obtained. The dual multiplexer selects a pair of inputs, (A, B) or (X,Y), and feeds them into the input of $f_1$ . The use of the composite pipeline is described by the combined reservation table. (a) Complete the following reservation table for this composite pipeline. | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |----|---|---|---|---|---|---|---|---|---|----|----|----| | S1 | X | | | | | | | | | | | | | S2 | | | | | | | | | | X | | | | S3 | | | | X | | | | | | | | | | T1 | | | | | | | | | | | | | | T2 | | | | | | | | | | | | | | T3 | | | | | | | X | | | | | | | U1 | | | | | X | | | | ļ | | | | | U2 | | | | | | | | | | | | | | U3 | | | | | | | | | | | | | - (b) Write the forbidden list and the initial collision vector. - (c) Draw a state diagram clearly showing all latency cycles. - (d) List all simple cycles and greedy cycles. - (e) Calculate the MAL and the maximal throughput of this composite pipeline. **Problem 6.13** A nonpipelined processor X has a clock rate of 250 MHz and an average CPI (cycles per instruction) of 4. Processor Y, an improved successor of X, is designed with a five-stage linear instruction pipeline. However, due to latch delay and clock skew effects, the clock rate of Y is only 200 MHz. - (a) If a program containing 1000 instructions is executed on both processors, what is the speedup of processor Y compared with that of processor X? - (b) Calculate the MIPS rate of each processor during the execution of this particular program. **Problem 6.14** Design a binary integer multiply pipeline with five stages. The first stage is for partial product generation. The last stage is a 36-bit carry-lookahead adder. The middle three stages are made of 16 carry-save adders (CSAs) of appropriate lengths. - (a) Prepare a schematic design of the five-stage multiply pipeline. All line widths and interstage connections must be shown. - (b) Determine the maximal clock rate of the pipeline if the stage delays are $\tau_1 = \tau_2 = \tau_3 = \tau_4 = 9$ ns, $\tau_5 = 4$ ns, and the latch delay is 1 ns. - (c) What is the maximal throughput of this pipeline in terms of the number of 36-bit results generated per second? **Problem 6.15** Consider a four-stage floating-point adder with a 2-ns delay per stage which equals the pipeline clock period. - (a) Name the appropriate functions to be performed by the four stages. - (b) Find the minimum number of periods required to add 100 floating-point numbers $A_1 + A_2 + \cdots + A_{100}$ using this pipeline adder, assuming that the output Z of stage $S_4$ can be routed back to either of the two inputs X or Y of the pipeline with delays equal to a multiple of the clock period. **Problem 6.16** Consider two four-stage pipeline adders and a number of noncompute delay elements. Each delay element has a one-unit time delay. (a) Use the available adders and delays to construct a composite pipeline unit for evaluating the following expression: b(i) = a(i) + a(i-1) + a(i-2) + a(i-3) for all i = 4, 5, ..., n. The composite pipeline receives a(i) for i = 1, 2, ..., n, as the successive inputs. (b) Consider a third four-stage pipeline adder. Augment the design in part (a) with this third adder to compute the following recursive expression: x(i) = a(i) + x(i-1), for all i = 4, 5, ..., n. Note that x(i) = a(i) + x(i-1) = a(i) + x(i-1) + x(i-2)] = ... = b(i) + x(i-4), where b(i) is generated by the composite pipeline in part (a).